Recovering from a Disaster in an Exchange Server 2010 Environment : Recovering from a Site Failure

1/8/2011 4:09:12 PM

When a site becomes unavailable because of a physical access limitation or a disaster such as a fire or earthquake, steps must be taken to provide the recovery of the Exchange server in the site. Exchange Server does not have a single-step method of merging information from the failed site server into another server, so the process involves recovering the lost server in its entirety.

To prepare for the recovery of a failed site, an organization can create redundancy in a failover site. With redundancy built into a remote site, the recovery and restore process can be minimized if a recovery needs to be performed.

For environments in which SLAs offer little time to bring up a recovery location, administrators should strongly consider implementing Database Availability Groups, a new feature of Exchange Server 2010 that replaces CCR and SCR.

Creating Redundant and Failover Sites

Redundant sites are created for a couple of different reasons. First, a redundant site can have a secondary Internet connection and bridgehead routing server so that if the primary site is down, the secondary site can be the focus for inbound and outbound email communications. This redundancy can be built, configured, and set to automatically provide failover in case of a site failure.

The other reason for a redundant site is to provide geographic failover to allow for transparent disaster recovery. In Exchange Server 2010, although you could build a “warm standby” site in which you would install Exchange Server 2010 when needed for recovery, that would provide no benefits versus building a redundant site that is already replicated with the mailbox data. This is exactly what Database Availability Groups provide when placed in a site that also has the Client Access Server and Hub Transport Server roles available.

If you plan to utilize redundant DR sites, be sure to update those sites with patches and applications as you apply them to the production systems. This ensures that the remote replicas are usable should you have a failure in the primary location.

Creating the Failover Site

When an organization decides to plan for site failures as part of a disaster recovery solution, many areas need to be addressed and many options exist. For organizations looking for redundancy, network connectivity is a priority, along with spare servers that can accommodate the user load. The spare servers need to have enough disk space to accommodate a complete restore. As a best practice, to ensure a smooth transition, the following list of recommendations provides a starting point:

Allocate the appropriate hardware devices, including servers with enough processing power and disk space to accommodate the restored machines’ resources.
Host the organization’s external DNS zones and records using primary DNS servers located at an Internet service provider (ISP) collocation facility, or have redundant DNS servers registered for the domain and located at both physical locations.
Publish the recovery site’s IP address as a lower-priority MX record. This way, when the recovery server comes online, you won’t have to wait for DNS propagation to advertise the new MX record.
Ensure that network connectivity is already established and stable between sites and between each site and the Internet.
Create at least one copy of backup tape medium for each site. One copy should remain at one location, and a second copy should be stored with an offsite data storage company. This is necessary only if recovery of mailbox data beyond the internal retention policies is needed.
Have a copy of all disaster recovery documentation stored at multiple locations and at the offsite data storage company. This provides redundancy if a recovery becomes necessary.

When the systems are in place in the failover site and configured to support a Database Availability Group, the data will automatically be replicated from the master copy and will be available when needed. Be sure to account for the amount of replication traffic that will be passed over the WAN to the disaster recovery site. Although the log files are compresses, they are still potentially a large source of data. To get an idea of the amount of data that will be replicated, look at the volume of log files generated on the primary server each day. That is the amount of data that will be replicated to each replica. For sites running multiple replicas across WAN connections, this can be a significant volume of data.

Failing Over Between Sites

When utilizing Database Availability Groups with replicas in a failover site that also has CAS and HT roles available, the process of failing services from the primary site to the DR site is easy:

1.	Launch Exchange Management Console.
2.	Expand Organization Configuration.
3.	Click mailbox.
4.	Click the Database Management tab.
5.	Right click the database copy you’d like to activate.
6.	Select Activate Database Copy.
7.	When the wizard launches, if desired, enter an override mount dial for the operation; click OK.
8.	When the wizard is completed, click Finish.

The same process can be done entirely from the Exchange Management Shell as well by following these steps:

1.	Launch Exchange Management Shell
2.	Type Move-ActiveMailboxDatabase –Identity DBName –ActivateOnServer NewServer. For example, Move-ActiveMailboxDatabase –Identity 'Mailbox Database 2010A' –ActivateOnServer E2010.

If, on the other hand, the failover isn’t a planned event, the mailbox databases within a DAG will be automatically failed over to the site holding the second highest-priority copy of the mailbox database. The preceding steps would primarily be used for DR testing or to move services to enable systems to be patched or upgraded in some manner.

Failing Back After Site Recovery

When the initial site is back online and available to handle client requests and provide access to data and networking services and applications, it is time to consider failing back the services. This process is greatly improved in Exchange Server 2010 through the use of Database Availability Groups. Unlike SCR, which was used for DR in Exchange Server 2007, there is no need to reestablish the replication relationship. A DAG simply continues to replicate mailbox data to all other replicas. This means that if mailbox master status is moved from ServerA to ServerB, ServerB will replicate to ServerA. If, on the other hand, ServerA were unavailable for an extended period of time and ServerB were to become too far out of sync and ServerA needed to be reseeded, Exchange Server 2010 supports the concept of incremental reseeding; the amount of data that would need to be sent back to ServerA would be significantly less than it would have been in Exchange Server 2007 with SCR.

Questions to consider for failing back are as follows:

Will downtime be necessary to restore databases between the sites?
When is the appropriate time to fail back?
Is the failover site less functional than the preferred site? In other words, are only mission-critical services provided in the failover site, or is it a complete copy of the preferred site?

The answers lie in the complexity of the failed-over environment. If the cutover is simple, there is no reason to wait to fail back.

Providing Alternative Methods of Client Connectivity

When failover sites are too expensive and are not an option, it does not mean that an organization cannot plan for site failures. Other lower-cost options are available but depend on how and where the employees do their work. For example, many times users who need to access email can do so without physically being at the site location. Email can be accessed remotely from other terminals or workstations.

The following are some ways to deal with these issues without renting or buying a separate failover site:

Consider renting racks or cages at a local ISP to co-locate servers that can be accessed during a site failure.
Have users dial in from home to a terminal server hosted at an ISP to access Exchange Server.
Set up remote user access using Terminal Services or Outlook Web App at a redundant site so that users can access their email, calendar, and contacts from any location.
Configure Outlook to utilize Outlook Anywhere on “slow” connections. This enables them to connect normally while in the office but can utilize “public” connections to connect should the office be unavailable.
Rent temporary office space, printers, networking equipment, and user workstations with common standard software packages such as Microsoft Office and Microsoft Internet Explorer. You can plan for and execute this option in about one day. If this is an option, be sure to find a computer rental agency first and get pricing before a failure occurs, and you have no choice but to pay the rental rates.